How can Social Media Sites like Twitter can be used to predict future outcomes like elections?
''' 1.1 Introduction''' With 554,750,000 active registered users; 135,000 new users signing up every day and 58 million tweets per day- it is obvious that Twitter has huge amount of user-generated data. Needless to say, this content contains a diverse range of political insight. But can this data be used to predict real world outcomes like election results? To what extent is this data representative of the whole sample (the part of population for whom the topic is relevant.)? What are the advantages and disadvantages of using Social Media to predict future outcomes? And, how relevant are these results? Can these prediction methods replace the existing traditional methods of market surveys, polls etc.? We start by looking at the existing methodologies for measuring the beliefs and intentions of individuals, for example market surveys and polls. These polls have a number of disadvantages related to them in terms of cost, time and human effort involved. But thanks to social media sites like Twitter, Facebook etc.; we have a large amount of information available on the internet about every topic, or major events or happenings in different parts of the world- in the form of people’s sentiments on the social media sites. The intuitive idea is to utilize this user generated information to produce the results which are traditionally published using surveys and polls. This is a challenging task. We need to ensure that our sample has representative distribution. We have to take into account the sentiments in the user-generated content. Amongst all these factors, we have one feature in our favor – '''Wisdom of Crowds'''. The concept of wisdom of crowds tells us that the error reduces by a factor as large as the crowd when we average the estimates first. But we have to take into account a lot of other factors, along with Wisdom of Crowds. We have to take into account the following factors:• Appropriate '''Sampling''' Approaches (to make sure the sample population is representative)• Methods of Modeling '''Political Sentiment'''• Incorporating the notion of Political Sentiment in our '''Prediction Model''' Once we have developed the model, we compare its prediction outcomes with the traditional polls and the actual election outcomes. In scenarios like result forecasting, the final result is used to judge the accuracy of the underlying prediction model; rather than the continuous time data. There are concerns about using Twitter as a reliable source of data; with two major ones being inability to determine the representative sample accurately and potential for deliberately influencing the results using spamming etc. Hence, these predictive schemes seem to be both promising and challenging. '''1.2 Short Answer''' In order to further our understanding of the method of election predictions using Twitter, we focus on the '''2011 Irish General Election''' to model political sentiment through the mining of social media. These elections were held on '''February 25th, 2011'''. There were five major political parties: '''Fianna Fail(FF), The Green Party, Labour, Fina Gael(FG) and Sinn Fein(SF)'''. 32,578 tweets relevant to these five parties were collected between February 8th and February 25th. The relevant tweets were identified using the parties names and abbreviations, and the election hashtag '''#ge11'''. The prediction error is measured in terms of '''Mean Average Error (MAE)''' , i.e. the average of the errors in each forecast. MAE measures the deviation between the predicted values from the actual values. It is used to measure the error between Twitter predictions and actual results; as well as Twitter predictions and polls. In order to predict the election results, we would like to base our assumptions on two factors: '''volume and sentiment.''' It is only logical to assume that larger the volume of related content (number of tweets) for a given political party; more will be the number of votes that party receives. This can be justified as follows- large volume of content would mean the party would attract more attention and will have more number of candidates; and hence, more number of votes. This leads to a conclusion that larger parties will have a larger presence and a large vote bank as compared to smaller parties. But is this the correct measure of popularity? Volume could be easily affected by a few very prominent stories or deliberate spamming. Thus, the '''volume based measure''' of popularity has to be defined carefully. Our '''volume-based measure''' is the ratio of number of tweets relevant for a given party and sum of tweets relevant for all the parties; it can be written as follows: Where SoV(x) represents the Share of Volume for a given party x.N is the total number of parties (in our case, this number is 5)Rel(x) is the number of tweets relevant to party x. This formula has an advantage that the Share of Volume for all the parties sum upto one; hence leading to easier analysis and comparisons. We use different sample sizes (to make sure that our sample is representative and to find the best sample out of all). These are as follows:*'''Time- Based''': Most recent ones- ranging from 24 hours, 3 days and 7 days. *'''Sample-Size Based''': Most recent 1000, 2000, 5000 or 10000 tweets. *'''Cumulative''': All tweets from February 8th to relevant time. *'''Manual''': Manually labeled tweets from February 8th. : The second aspect of prediction is '''Sentiment Analysis'''. Previous research has shown that supervised learning provides more accurate sentiment analysis as compared to unsupervised methods. So its better to use trained classifiers for the elections. Also, different trained annotators should be used; and the data should be collected over varied time (to make sure the sample is diverse enough). In our example, the annotation categories are as follows: *'''Three Sentiment Classes''' (Positive, Negative, Mixed) *'''One Non-Sentiment Class''' (Neutral) *'''Three other Classes''' (Unannotatable, Non-relevant, Unclear ) These three classes, Unannotatable, Non-relevant and Unclear respectively, are disregarded. The mixed annotations are also disregarded because they are few in number and unambiguous. Various Socio-linguistic features like emoticons and unconventional punctuations are also taken into account, because these features add tone to the text and are likely to add the value to the sentiment. All the topic terms, usernames and url’s are removed to make the classification as unbiased as possible. After deciding how to classify the tweets according to different sentiments, next step is to decide how to incorporate this sentiment into the prediction model. Sentiment Distribution in the tweets for a given party indicates the disposition of people towards that particular political party. If majority of the tweets have a negative sentiment, it is likely that people have a negative inclination towards that party. But, this is true for a party in isolation. But we are considered all the political parties involved in the election at the same time. In a closed system like election, '''relative sentiment''' between different parties becomes much more important. To address the Relative Sentiment issue, we modify the SoV (Share of Volume) Parameter, to represent the share of positive and negative volume as follows: Where SoVP is the share of positive volume;SoVn is the share of negative volume;Pos (x) is the number of tweets with a positive sentiment for a given party x;Neg(x) is the number of tweets with a negative sentiment for a given party x;N is the total number of parties. This is the '''Inter Party sentiment'''. For '''Intra Party Sentiment''', we use a log-ratio sentiment as follows: Pos (x) is the number of tweets with a positive sentiment for a given party x;Neg(x) is the number of tweets with a negative sentiment for a given party x. This tells us how positive or negative the tweets are for a given topic. Its value is positive when the number of positive tweets is more than the number of negative tweets; and is negative when the number of negative tweets is greater than the number if positive tweets. ''' 1.3: Long Answer'''